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Small over-represented motifs in biological networks often form essential functional units of 
biological processes. A natural question is to gauge whether a motif occurs abundantly or 
rarely in a biological network. Here we develop an accurate method to estimate the occur- 
rences of a motif in the entire network from noisy and incomplete data, and apply it to 
eukaryotic interactomes and cell-specific transcription factor regulatory networks. The 
number of triangles in the human interactome is about 194 times that in the Saccharomyces 
cerevisiae interactome. A strong positive linear correlation exists between the numbers of 
occurrences of triad and quadriad motifs in human cell-specific transcription factor regulatory 
networks. Our findings show that the proposed method is general and powerful for counting 
motifs and can be applied to any network regardless of its topological structure. 
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The increasing availability of genomic and proteomic data 
has propelled network biology to the frontier of biomedical 
research 1-4 . Network biology uses a graph to depict 
interactions between cellular components (proteins, genes, meta- 
bolites and so on), where the nodes are cellular components and 
the links represent interactions. Two of the most surprising 
discoveries from the genome sequencing projects are that the 
human gene repertoire is much smaller than had been expected, 
and that there are just over 200 genes unique to human beings 5 . 
As the number of genes alone does not fully characterize the 
biological complexity of living organisms, the scale of physio- 
logically relevant protein and gene interactions are now being 
investigated to understand the basic biological principles of life . 
Although the list of known protein-protein interactions (PPIs) 
and gene regulatory interactions (GRIs) is expanding at an 
ever-increasing pace, the human PPI and GRI networks are far 
from being complete and, hence, their dynamics have yet to be 
uncovered . 

The feed-forward loop (FFL) and several other graphlets (called 
motifs) are found to be over- represented in different biological 
networks 11 . Furthermore, over-represented motifs usually 
represent functional units of biological processes in cells. 
Hence, it is natural to ask whether a motif, such as a triangle, 
appears more often in the interactome of humans than in that of 
other species, or whether the FFL or the bi-fan appears more 
frequently in the human gene regulatory network. As the 
biological networks that have been reported are actually the 
subnetworks of the true ones and often contain remarkably many 
incorrect interactions for eukaryotic species, there are two 
approaches to answering these questions. One approach is to 
infer spurious and missing links in the entire network 12-14 , and 
then to count motif occurrences. Another approach is to estimate 
the number of motif occurrences in the interactome from the 
observed subnetwork data using the same method as that for 
estimating the size of eukaryotic interactomes 9 ' 10 ' 15 . If we have 
the number of occurrences of a motif or its estimate in a network, 
we can determine whether the motif is over-represented or not, 
based on how often the motif is seen in a random network with 
similar structural parameters 11 ' 16 ' 17 . 

In the present work, we take spurious and missing link errors 
into account to develop an unbiased and consistent estimator for 
the motif count. The method works for both undirected and 
directed networks. We derive explicit mathematical expressions 
for the estimators of five commonly studied triad and quadriad 
network motifs (Fig. 1). These estimators are further validated 
extensively for each of the following four models: Erdos-Renyi 
(ER) 18 , preferential attachment 19 , duplication 20 and the geometric 
model 21 (Supplementary Note 1). By applying the method to 
eukaryotic interactomes, we find that the number of triangles in 
the human interactome is about 194 times that of the 
Saccharomyces cerevisiae interactome, three times as large as 
expected. By applying the method to human cell- specific 
transcription factor (TF) regulatory networks 22 , we discover a 
strong positive linear correlation between the counts of widely 
studied triads and quadriads. We also notice that the embryonic 
stem cell's TF regulatory network has the smallest number of 
occurrences relative to its network size for all the five motifs 
under study. 
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Figure 1 | Network motifs found in biological networks. The feed-forward 
loop, bi-fan and biparallel are over-represented, whereas feedback loop 
is under-represented in gene regulatory networks and neuronal connectivity 
networks 11 . 



V= {1,2,3,..., «}. Let £ obs (y obs , £ obs ) be an observed subnetwork 
of Q. Following (ref. 9), we model an observed subnetwork as the 
outcome of a uniform node sampling process in the following 
sense. Let X { be independent and identically distributed Bernoulli 
random variables with the parameter p e (0,1] for i = 1,2, . . . ,n. We 
use Xi — 1 and X t — 0 to denote the events that node i is sampled 
and not sampled, respectively. Then V° bs is the set of nodes i with 
Xi=l, and £ obs is induced from E by V° bs . For clarity of 
presentation, we first introduce our method for the case when the 
observed subnetwork is free from experimental errors, and then 
generalize it to handle noisy observed subnetwork data. 

Consider a motif M. We use N M and Nfy s to denote the 
number of occurrences of M in Q and C/ obs , respectively. We 
assume that the number of nodes, n, is known, but only links in 
G obs are known. We are interested in estimating N M from the 
observed subnetwork C/ obs . As C/ obs is assumed to be free from 
experimental errors, we can obtain Nfy s simply by enumeration. 
Let us define the following: 



„obs 



AT 



M ' 



(i) 



where m and n obs are the number of nodes in M and Q° , 
respectively. 

Let A= [fly]i<i, ; < n denote the adjacency matrix of Q, where 
fly = 1 if there is a link from i to j, and — 0 otherwise. 
Furthermore, for a subset /^{l,2,...,n}, A[/] denotes the 
submatrix consisting of entries in the rows and columns indexed 
by /. We write N M as a function of A and Nfy s as a function of A 
and the random variables X t . We also assume the following: 



N M = fM(Mh: 



*2,- 



»])> 



(2) 



ti < i 2 < ■■■ < i m 



Results 

Estimating motif occurrences. In this study, we shall consider 
PPIs and gene regulatory networks. The former are undirected, 
whereas the latter are directed networks. Consider a directed or 
undirected network Q(V, £), where V is the set of nodes and E is 
the set of links. For simplicity, we assume that Q has n nodes and 

2 



N M = fM(A[i u i 2 , i m ])X h X h ...X im , (3) 

z'i < i 2 < ■■■ < i m 

where f M {) is a function chosen to decide whether M occurs 
among nodes /!,/ 2 ,...,/ m or not. For the motifs listed in Table 1, 
their corresponding functions f M Q are given in Supplementary 
Table SI. 
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Table 1 | Bias-corrected estimators for 14 motifs. 

Motif Bias-corrected estimator 
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n (n obs ), the number of nodes in the entire network (respectively, the observed subnetwork). 
m„ the number of nodes in motifs of type-/'. 

N° bs , the number of occurrences of motifs of type-/' observed in the subnetwork data. 
r=1 — r_ — r+. 



From equations (1) and (3), we have 



Hence, by equation (2), 



£(Nm) = 



^ /^(Aliijia,-, im])xE 



l<iy < i 2 < ... < i m <n 



X^Xi 2 ■ ■ ■ Xj m 
\ \ m 



where a random variable such that 



„ obs =X 1 +X 2 + ••• +X„. 



(4) 



E(N M )=[ m )N M E 



( \ 

XiX 2 • • • X 1 

obs 



\ \ m J J 
— n(n — 1) ■■■(« — m + 1)N^ 



As the random variables X t are independent and identically 
distributed, for any \ <i 1 <i 2 < ... <i m < n, we also have 



/ \ 

obs 



\ \ m ) J 



= E 



( \ 

X\X 2 • • • x m 

( n ohs \ 
V \ m J J 



xE 



X\X 2 • • • x n 



By conditioning on the event that X x — X 2 
rewrite equation (4) as 



n obs = Z + m, 



n ohs (n ohs -!)••• (n obs - m + 1), 

• • —X m — 1, we 
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where Z~Binomial(n — ra,p), and hence 

=n(n-l) • • • (n-m + l)p n 



xE 



(Z + m)(Z + m-l) • • • (Z+r 



As 
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-u du 
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Jo 



(1- 



(m- 1) ! 



-E(u z )du, 



we have 



(5) 



by applying integration by parts and simplification. Therefore, we 
have obtained the following theorem. 

Theorem 1: Let Q be a network of n nodes. Assume C/ obs is a 
subnetwork of Q obtained by a uniform node sampling process 
that selects a node with^probability p. For any motif M of m 
nodes, the estimator Nm^ denned in equation (1) satisfies 
equation (5). Therefore, Nm is an ^asymptotically unbiased 
estimator for N M in the sense that E(Nm/Nm) —> 1 as n goes 
to infinity. Moreover, the convergence is exponentially fast in n. 

When the estimator (1) is applied to estimate the number of 
links in an undirected network Q, the variance has the following 
closed-form expression: 



Var 



= (2qN 2 



V2 1-p* 



(l + 0(n" 1 )) + 0(n- 1 ), 



where Ni and N 2 are, respectively, the number of links and three- 
node paths in Q (Supplementary Methods). This leads to our next 
theorem. 

Theorem 2: When Q is generated from one of the ER, preferential 
attachment, duplication or geometric models, Var(Ni/IVi) —> 0 as 
n goes to infinity. ^ 

Theorem 2 says that N\ is consistent. For an arbitrary motif, 
it is much more complicated to derive the variance of the 
estimator (1). Nevertheless, our simulation shows that for all the 
motifs in Fig. 1, the variance of the estimator converges to zero as 
n goes to infinity and, hence, it is consistent (Fig. 2 and 
Supplementary Figs S1-S8). We wish to point out that the 
notions 'asymptotically unbiased' and 'consistent' are not used in 
the usual statistical sense where the population is fixed and the 
number of observations increases to infinity. 

For realistic estimation, one has to take error rates into 
account, as detecting PPIs or GRIs is error prone to some degree. 
PPIs or gene regulatory networks have spurious interactions (that 
is, false positives) and missing interactions (that is, false negatives). 
We define the false-positive rate r + to be the probability that a 
non-existing link is incorrectly reported, and the false-negative 
rate r_ to be the probability that a link is not observed. Using the 
independent random variables F^ 2 ~ Bernoulli(r + ) and F-~- 2 ~ 
Bernoulli(r_ ) to model spurious and missing interactions in 
the observed subnetwork C/ obs , we can represent the effect of 
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Figure 2 | Plots of MSE(N F fl) and MSE(N F fl) for counting the 
occurrences of FFL The random networks of n nodes and edge density p 
are generated from the preferential attachment model. Both MSE(/Vffl) and 
MSE(/Vffl) depend on n, p and the node sampling probability p. MSE(Nffl) 
also depends on the link error rates r_ and r + . (a) MSE(/V F fl) changes with 
n and p when p = 0.1. (b) MSE(/V F fl) changes with n and p when p = 0.1, 
r_ =0.85 and r + =0.00001. (c) MSE(/V FFL ) and MSE(/V FFL ) change with p 
when n = 5,000, p = 0.1, r_ =0.85 and r + =0.00001. (d) MSE(N F fl) 
changes with r + and r_ when n = 5,000, p = 0.1 and p = 0.1. 



experimental errors on an ordered pair of nodes (/i,/ 2 ) as 

a hh = a hh {\ - Fr h ) + (1 - a hh )F± 2 . (6) 

In other words, for any two nodes i'i,/ 2 eV° bs , a link (fi,i 2 ) is 
observed in the subnetwork C/ obs (that is, a ili2 = 1) if (i) there is a 
link (/i,/ 2 ) in the real network Q (that is, a iv i 2 — 1) and there is no 
false negative (that is, — 0) or (ii) the link (/i,/ 2 ) does not exist 
in the real network Q (that is, a ix / 2 = 0) but a false positive occurs 
(that is, F+ h = 1). 

To take error rates into account, we simply replace each entry 
Ojij2 in the adjacency matrix A with to obtain a new matrix, 
A, and then replace A with A in equation (3). For any motif M in 
Table 1, the expectation of the estimator Nm in equation (1) can 
be expressed as (Supplementary Methods) 



E(N 



M) 



!-e (;w- j 1 hid 



> s Nm + W m ], 



where s is the number of links that M has and Wm is a function 
of n, r_, r + , and N M > for all proper submotifs M' of M 
(Supplementary Table S2)^Thus, to correct the bias caused by 
link errors, we derive Wm from Wm by replacing Nm' 
with N M > for all submotifs of and use the following formula 
to estimate N^: 



1 



(1- 



(Nm-Wm). 



(7) 



For the motifs listed in Fig. 1, the corresponding bias-corrected 
estimators are given in Table 1. 

We examined the accuracy of the proposed estimators on 
networks generated by a random network model. As these 
estimators are asymptotically unbiased, we used the mean square 
error (MSE) of the ratios Nm/Nm and Nm/Nm> defined later in 
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equation (9), to measure their accuracy (see Methods section). 
Figure 2 summarizes the simulation results for the FFL motif in 
random networks generated from the preferential attachment 
model 19 . (The results for other motif network model 
combinations are similar and can be found in Supplementary 
Figs S1-S8.) First, when the edge density p is fixed, the MSE of 
the estimators for FFL decreases and converges to zero as n 
increases (Fig. 2a,b). Second, the MSE decreases as the edge 
density increases, suggesting that the estimators are even more 
accurate when applied to less sparse networks. Third, the MSE of 
the estimators decreases as p increases (Fig. 2c). Finally, the MSE 
increases with r_ and r + (Fig. 2d). Altogether, our simulation 
tests confirm that the proposed estimators are accurate for any 
underlying network. 



Motif richness in the human interactome. The entire inter- 
actomes for eukaryotic model organisms such as S. cerevisiae, 
Caenorhabditis elegans, Homo sapiens and Arabidopsis thaliana 
are not fully known. We estimated the interactome size (that is, 
the number of interactions) and the number of triangles in the 
entire PPI network for S. cerevisiae, C. elegans, H. sapiens and 
A. thaliana, using the data sets CCSB-YI1 (ref. 23), CCSB-WI- 
2007 (ref. 24), CCSB-HI1 (refs 25,26) and CCSB-AIl-Main 27 . 
These data sets were produced from yeast two-hybrid 
experiments and their quality parameters are summarized in 
Table 2 for convenience. 

First, we re- estimated the size of four interactomes using the 
bias-corrected estimator N\ (Table 1). To test all possible 
interactions between selected proteins, the sets of bait and prey 
proteins should be exchanged in the two rounds of interaction 
mating in a high-throughput yeast two-hybrid experiment 28 . 
However, this is only true for the C. elegans and H. sapiens data 
sets (CCSB-WI-2007 and CCSB-HI1, respectively). For the 
S. cerevisiae and A. thaliana data sets (CCSB-YI1 and CCSB- 
AIl-Main, respectively), the set of bait proteins are slightly 
different from the set of prey proteins. For these two cases, we 
applied our estimator to the subnetwork induced by the 
intersection of the bait and prey protein sets. 



The following estimator was proposed by Stumpf et al. 9 for the 
size of an interactome and was later used to estimate the size of 
the eukaryotic interactomes 23 ' 24 ' 26 ' 27 : 



(No. of observed interactions) x Precision 
Completeness x Sensitivity 



(8) 



where 'completeness' is the fraction of all possible pairwise 
protein combinations that have been tested. In our notation, 

(No. of observed interactions) =N° hs , 

Sensitivity = 1 — r_ , 

Precision = l — r d , 

Completeness = ^ n ^ ^ ^ ^ 2 ) ' wnere r d is the proportion of 

spurious links among detected links and is called the false 
discovery rate. (Note that r d was called the false-positive rate in 
ref. 9.) Thus, the estimator (8) becomes 



.obs 



obs 



\ 



^obs 



obs 
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For PPI data sets, r + is about 10 ~ 4 and^ thus 1 — r_ ^ 
1 — r_ — r + . As r d is also small, our estimator N\ handles errors 
differently but is quite close to the estimator (8). In particular, 
when the precision is 100% or, equivalently, when r d — r + =0, 
these two estimators are equal (Supplementary Note 2 and 
Supplementary Fig. S9). Indeed, our estimates for interactome 
size agree well with those obtained from equation (8) (Table 2). 
Such an agreement demonstrates again that our estimators for 
counting motifs are accurate. 

We proceed further to estimate the number of triangles in each 
of the interactomes using the corresponding bias-corrected 
estimator N 3 in Table 1. For each interactome, we estimated 
the number of triangles from the observed subnetwork data 
directly and from sampling the observed subnetwork repeatedly. 
The two estimates agree well (Table 2). 

Our estimation shows that although the size of the A. thaliana 
interactome is about 1.8 times that of the human interactome, it 



Table 2 | The interactome size and the number of triangles in the PPI networks of four species in our study. 





S. cerevisiae 


C. eiegans 


H. sapiens 


A. thaliana 


Total no. of proteins 

No. of proteins screened* 

No. of links detected* 


6,000 
3,676 
967 


20,065 
9,906 
1,816 


22,500 
7,194 
2,754 


27,029 
7,108 
4,890 


Quality parameters* 

Precision"'' 

Sensitivity 

False-negative rate (r_) 
False positive rate (r+) 


0.9400 
0.1700 
0.8300 
0.8x10" 5 


0.8600 
0.0496 
0.9504 
0.5x10" 5 


0.7940 
0.0950 
0.9050 

2x10" 5 


0.8030 
0.1570 
0.8430 

3x10" 5 


Interactome size 
CCSB estimate* 
Our estimate^ 
Mean ± s.d § 
Link density 


18,000 ±4,500 

14,000 
15,000 ±2,700 
8x10" 4 


116,000 ±26,400 

121,000 
122,000 ±16,600 
6x10" 4 


130,000 ±32,600 

210,000 
214,000 ±32,200 

8 x10" 4 


299,000 ±79,200 

377,000 
376,000 ±45,600 
10 x10" 4 


No. of triangles 
Our estimate^ 
Mean ± s.d.§ 
Triangle density 


53,000 
61,000 ±33,800 
1 x10" 6 


6,263,000 
5,971,000 ±3,593,800 
5x10" 6 


10,270,000 
11,255,000 ±4,717,100 
5x10" 6 


10,697,000 
10,158,000 ±4,289,000 
3 x10" 6 



CCSB, Center for Cancer Systems Biology; PPI, protein-protein interaction. 

*Reported in refs 23-27. 

f False discovery rate = 1 — precision. 

lEstimates have been calculated from the observed PPI subnetworks. 

§Mean and s.d. of the estimates have been calculated by sampling 100 sub-data sets from the observed subnetwork data using the node sampling probability 0.1. 
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contains fewer triangles than the human interactome does. The 
triangle density of the human and C. elegans inter actomes are 
similar and are 1.7 times that of the A. thaliana and 5 times that 
of S. cerevisiae. The size of the human interactome is only 
15 times that of the S. cerevisiae interactome, yet the number of 
triangles in the former is about 194 times that in the latter, 
3 times as large as expected. 

Correlation between motif counts in TF regulatory networks. 

Recently, the TF regulatory networks of 41 human cell and tissue 
types were obtained from genome-wide in vivo DNasel footprints 
map 22 . In these networks, the nodes are 475 TFs and the 
regulation of each TF by another is represented by network- 
directed links. Motif count analysis showed that the distribution of 



the motif count is unimodal, with the peak corresponding to the 
mean value for each motif (diagonal panels in Fig. 3). Surprisingly, 
there is a very strong linear correlation between the counts in the 
TF regulatory networks of different cell types, even for the triad 
and quadriad motifs that are topologically very different (Fig. 3). 

Given that human has about 2,886 TF proteins 29 , we further 
estimated the number of occurrences of the 5 motifs for each of 
the 7 functionally related classes of cells (Fig. 3 and Table 3). This 
was achieved by simply setting the false-positive and -negative 
rates to 0, as they are currently unknown. The TF regulatory 
networks of blood cells have diverse motif counts. Specifically, 
for all triad and quadriad motifs, the promyelocyte leukemia cell 
TF regulatory network has the largest number of occurrences, 
whereas the erythoid cell TF regulatory network has the smallest 
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Figure 3 | Correlation of motif counts in 41 human cell-specific TF regulatory networks. The upper triangular panels are scatter plots of the counts of the 
5 motifs in the TF regulatory networks of one embryonic stem cell (black), 7 blood cell types (red), 2 cancer cell types (green) and 31 other cell and 
tissues types (grey) 22 . Here the x and y axes represent the estimated counts of the two corresponding motifs. Each diagonal panel shows the distribution of 
these motifs' occurrences, in which the x and y axes represent the estimated motif count and the number of TF regulatory networks, respectively. The 
correlation coefficients of the motifs' occurrences are given in the lower triangular panels. 



Table 3 | The estimated network size and count of triad and quadriad motifs in seven cell classes. 

No. of links No. of feedback loop No. of FFL No. of biparallel 


No. of bi-fan 


Epithelia 


344 ± 59* 


1,896 ±844 


19,901 ±8,419 


1,858,957 ±1,013,756 


3,238,587 ±1,618,601 


Stroma 


412 ±38 


2,727 ±705 


29,155 ±6,290 


3,052,803 ±883,160 


5,094,576 ±1,337,401 


Blood 


434 ± 97 


3,687 ±1,699 


37,884 ±15,241 


4,379,527 ± 2,320,472 


7,359,970 ± 3,421,025 


Endothelia 


447 ± 40 


3,160 ±695 


35,314 ±6,567 


3,844,161 ±948,207 


6,877,606 ±1,540,212 


Cancer 


380 ±7 


2,378 ±111 


30,122 ±710 


2,862,215 ±91,628 


6,267,987 ± 99,444 


Fetal cells 


426 ±70 


3,088 ±998 


33,782 ±9,955 


3,660,840 ±1,500,838 


6,498,027 ± 2,284,014 


ES cell 1 " 


485 


2,766 


32,400 


3,282,473 


6,436,708 


ES, embryonic stem; FFL, feed-forward loop; TF, transcription factor. 

*The motif count for each group is presented in the form mean±s.d., and the numbers are presented in thousands. 
tThere is only one ES cell TF regulatory network. 
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number of occurrences. The embryonic stem cell TF regulatory 
network has the smallest number of occurrences relative to its 
network size for all the motifs. 

In a random network, the ratio of the FFL count to the 
feedback loop count is ~ 3:1. However, in the human cell-specific 
TF regulatory networks, the ratio is about 10:1, suggesting FFL is 
significantly enriched in these networks. Table 3 also suggests that 
the bi-fan motif is relatively abundant in these networks, as the 
ratio of the bi-fan count to the biparallel count is roughly 1:2 in a 
random network. 

Discussion 

By taking spurious and missing link rates into account, we have 
developed a powerful method for estimating the number of motif 
occurrences in the entire network from noisy and incomplete data 
for the first time. It extends previous studies on interactome size 
estimation 9 ' 10 ' 23-27 to motif count estimation in a directed or 
undirected network. Such a method is important because exact 
motif enumeration is possible only if the network is completely 
known, which is often not the case in biology. Our proposed 
method has been proven mathematically as being unbiased and 
accurate without any assumption at all regarding the topological 
structure of the underlying networks. Therefore, our proposed 
estimators can be applied to all the widely studied networks in 
social, biological and physical sciences. 

Interactome size has been estimated from noisy subnetwork 
data by using equation (8), where the precision (which is 1 — r d ) and 
sensitivity of the data are taken into account 23,24,26 ' 27 . This approach 
might yield an inaccurate estimate, as the false discovery rate is often 
calculated from gold-standard data sets 30-33 and can be quite 
unreliable, as indicated in ref. 26, in which the false discovery rate 
for the data set CCSB-HI1 was adjusted from 87% to 93%, to 20.6%, 
after multiple cross- assay validation. By contrast, our proposed 
method uses false-positive and false-negative rates for motif count 
estimation. As the false-negative rate is equal to 1 — sensitivity and 
the false-positive rate is only about 10 ~ 4 , our method is more 
robust than estimations based on the false discovery rate. 



Theorems 1 and 2 in the present paper show that motif 
counting via sampling and then scaling up in a huge network is 
not merely fast but can also give accurate estimate. Take the 
triangle motif, for instance. In our validation test, the 
equation (l)-based sampling achieved less than 1% deviation 
from the actual count by using no more than 50% of the 
computing time compared with the naive triangle counting 
method (Fig. 4 and Supplementary Note 3). As the obtained 
sampling approach takes positive and negative link-error rates 
into account, it is a good addition to the methodology for 
estimating motif count in networks 34,35 . 

By applying our estimation method to PPI subnetwork data for 
four eukaryotic organisms, we found that the numbers of 
triangles in a eukaryotic interactome differ considerably. For 
example, the triangle motif is exceptionally enriched in the 
human interactome. As noted in ref. 9, we have to keep in mind 
that the estimates in Table 2 are based on PPIs that are detectable, 
given current experimental methods. However, our estimators 
will remain correct for any interaction data available in the future. 

We also discovered that there is a very strong positive linear 
correlation between triad and quadriad motif occurrences in human 
cell-specific TF regulatory networks, and that the TF regulatory 
network of embryonic stem cells has the smallest number of 
occurrences relative to its network size for each of the common 
triad and quadriad motifs. Hence, our study reveals a surprising 
feature of the TF regulatory network of embryonic stem cells. 

Finally, we remark that the proposed estimators for motif 
counting are derived using the assumption that the subnetwork 
data is the outcome of a uniform node sampling process. In 
practice, however, biologists may select proteins for study 
according to their biological importance. The accuracy of our 
proposed method was examined for a degree-bias and two other 
non-uniform node sampling schemes (Supplementary Note 4 and 
Supplementary Figs S10-S12). In the degree-bias sampling 
process, a network node is sampled independently with a 
probability that is proportional to its degree in the underlying 
network. By the nature of this sampling process, it leads to 




Figure 4 | Computational time efficiency of the proposed sampling approach. The simulation test was conducted on a network of 5,000 nodes with the 
edge density 0.1. The computational time efficiency of the sampling approach is defined as the ratio of the time taken by our approach to the time 
used by the direct counting approach, and MSE is defined in equation (9). (a) Computational time efficiency versus the MSE for four values of the node 
sampling probability p. When p = 0.1, 0.2, 0.3 and 0.4, the number of repetitions was set to 125/c, 25k, 5k and 2k (1</c<8), respectively, (b) When 
the node sampling probability p is fixed, the computational time efficiency increases as a linear function of the number of repetitions, (c) When the number 
of repetitions ( rep) is fixed, the computational time efficiency increases as a cubic function of p. (d) MSE decreases as the number of repetitions 
increases, (e) MSE decreases as p increases. 
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overestimation of motif count when our proposed estimator is 
used. Our simulation tests indicate that its effect on the 
estimation of motif count depends on the scale-free structure of 
the underlying network and the proportion of the sampled nodes. 
In particular, when more than 60% of nodes in a network are 
sampled, the estimate is no more than five times the actual count. 
Hence, the triangle counts in the four eukaryotic interactomes are 
likely less than the estimates listed in Table 2 by a small constant 
factor. How to correct the overestimation caused by a degree-bias 
node sampling is challenging and worthy to study in future. 

Methods 

Interaction data. Human, yeast, worm and A. thaliana PPI data sets were down- 
loaded from the Center for Cancer Systems Biology (CCSB) (http://ccsb.dfci.- 
harvard.edu): CCSB-YI1 (ref. 23), CCSB-WI-2007 (ref. 24), CCSB-HI1 (refs 25,26 
and CCSB-AIl-Main 27 . TF regulatory interaction data sets were downloaded from 
the Supplementary Information of ref. 22 in the Cell journal website. 

Simulation validation for motif estimators. We considered four widely used 
random graph models: ER 18 , preferential attachment 19 , duplication 20 and 
geometric models 21 (Supplementary Note 1). Using each model, we generated 
200 random networks by using different combinations of node number 
ne {500,1,000,1,500,..., 10,000} and edge density pe {0.01,0.02,..., 0.1}. Each 
generated network was taken as the whole network Q, from which 100 subnetworks 
were sampled using the node sampling probability pe{0.05,0. 1,0. 15,. ..,0.5}. For 
each motif M appearing in Fig. 1, we first computed Nm (given in equation (1)) 
from the motif count in each sampled subnetwork. This was used as an estimate of 
the number of occurrences of the motif in the error-free case, Nm- Spurious and 
missing interactions were then planted in the sampled^ subnetworks with the chosen 
error rates r + and r_ . The bias-corrected estimator Nj^ (given in Table 1) for Nm 
was then recalculated. We used the MSE of the ratios Nm/Nm an d Nm/Nm to 
measure the consistency (and hence accuracy) of Nm and Nm > respectively. 

For the estimator Y of a parameter 6, the MSE of Y in estimating 6 is defined as 

MSE(Y)=E((Y-0) 2 ). 
This expression can be used to measure the MSE made in the estimation. In our 
validation test, we sampled 100 subnetworks from a network Q to evaluate the 
consistency of the estimator Nm of a motif M. As E(Nm/Nm) approaches to 1 
when n is large (Theorem 1), the MSE(Nm/Nm) was approximately computed as 



MSE 



Nm 
N m 



1 

Too, 



N M ,i 
N M 



(9) 



where Nmj is the estimate calculated from the i th subnetwork using Nm> 
1 < i< 100. Computing MSE(N M /N M ) is similar. 
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